There are 2 packages you will need to install for today’s practical: install.packages(c("h2o", "eegkit", "forecast", "tseries") apart from that everything else should already be available on your system. However, I will endeavour to use explicit imports to make it clear where functions are coming from (functions without library_name:: are part of base R or a function we’ve defined in this notebook).

knitr::opts_chunk$set(echo = TRUE)

# experimenting with this ML library on my quest to find something pleasant to use in R
library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
h2o::h2o.init(nthreads = 1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 days 34 minutes 
##     H2O cluster timezone:       America/Halifax 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.36.1.2 
##     H2O cluster version age:    17 days  
##     H2O cluster name:           H2O_started_from_R_fin_smu226 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.38 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  1 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.1.3 (2022-03-10)
# EEG manipulation library in R (although very limited compared to signal processing libraries available in other languages, matlab might actually still be a leader in this specific area)
library(eegkit)
## Loading required package: eegkitdata
## Loading required package: bigsplines
## Loading required package: quadprog
## Loading required package: ica
## Loading required package: rgl
## Loading required package: signal
## 
## Attaching package: 'signal'
## The following objects are masked from 'package:stats':
## 
##     filter, poly
# some time series functions (that we only skim the depths of)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tseries)

# just tidyverse libraries that should already be installed
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:signal':
## 
##     filter
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(reshape2)
library(purrr)
library(ggplot2)

EEG Eye Detection Data

One of the most common types of medical sensor data (and one that we talked about during the lecture) are Electroencephalograms (EEGs).
These measure mesoscale electrical signals (measured in microvolts) within the brain, which are indicative of a region of neuronal activity. Typically, EEGs involve an array of sensors (aka channels) placed on the scalp with a high degree of covariance between sensors.

As EEG data can be very large and unwieldy, we are going to use a relatively small/simple dataset today from this paper.

This dataset is a 117 second continuous EEG measurement collected from a single person with a device called a “Emotiv EEG Neuroheadset”. In combination with the EEG data collection, a camera was used to record whether person being recorded had their eyes open or closed. This was eye status was then manually annotated onto the EEG data with 1 indicated the eyes being closed and 0 the eyes being open. Measures microvoltages are listed in chronological order with the first measured value at the top of the dataframe.

Let’s parse the data directly from h2o’s test data S3 bucket:

eeg_url <- "https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate_splits.csv"
eeg_data <- dplyr::as_tibble(h2o::h2o.importFile(eeg_url))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |======================================================================| 100%
# add timestamp
Fs <- 117 / dim(eeg_data)[1]
eeg_data <- eeg_data %>% dplyr::mutate(ds=seq(0, 116.99999, by=Fs), eyeDetection=as.factor(eyeDetection))
print(eeg_data %>% dplyr::group_by(eyeDetection) %>% dplyr::count())
## # A tibble: 2 × 2
## # Groups:   eyeDetection [2]
##   eyeDetection     n
##   <fct>        <int>
## 1 0             8257
## 2 1             6723
# split dataset into train, validate, test
eeg_train <- eeg_data %>% dplyr::filter(split=='train') %>% dplyr::select(-split)
print(eeg_train %>% dplyr::group_by(eyeDetection) %>% dplyr::count())
## # A tibble: 2 × 2
## # Groups:   eyeDetection [2]
##   eyeDetection     n
##   <fct>        <int>
## 1 0             4916
## 2 1             4072
eeg_validate <- h2o::as.h2o(eeg_data %>% dplyr::filter(split=='valid') %>% dplyr::select(-split))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
eeg_test <- h2o::as.h2o(eeg_data %>% dplyr::filter(split=='test') %>% dplyr::select(-split))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

0 Knowing the eeg_data contains 117 seconds of data, inspect the eeg_data dataframe and work out approximately how many samples per second were taken?

1 How many EEG electrodes/sensors were used?

Exploratory Data Analysis

Now that we have the dataset and some basic parameters let’s begin with the ever important/relevant exploratory data analysis.

First we should check there is no missing data!

h2o::h2o.nacnt(h2o::as.h2o(eeg_data))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Great, now we can start generating some plots to look at this data within the time-domain.

melt <- reshape2::melt(eeg_data %>% dplyr::select(-split), id.vars=c("eyeDetection", "ds"), variable.name = "Electrode", value.name = "microvolts")


ggplot2::ggplot(melt, ggplot2::aes(x=ds, y=microvolts, color=Electrode)) + 
  ggplot2::geom_line() + 
  ggplot2::ylim(3500,5000) + 
  ggplot2::geom_vline(ggplot2::aes(xintercept=ds), data=dplyr::filter(melt, eyeDetection==1), alpha=0.005)

2 Do you see any obvious patterns between eyes being open (dark grey blocks in the plot) and the EEG intensities?

3 Similarly, based on the distribution of eye open/close state over time to anticipate any temporal correlation between these states?

Let’s see if we can directly look at the distribution of EEG intensities and see how they related to eye status.

melt_train <- reshape2::melt(eeg_train, id.vars=c("eyeDetection", "ds"), variable.name = "Electrode", value.name = "microvolts")

# filter huge outliers in voltage
filt_melt_train <- dplyr::filter(melt_train, microvolts %in% (3750:5000)) %>% dplyr::mutate(eyeDetection=as.factor(eyeDetection))

ggplot2::ggplot(filt_melt_train, ggplot2::aes(y=Electrode, x=microvolts, fill=eyeDetection)) + ggplot2::geom_boxplot()

Plots are great but sometimes so it is also useful to directly look at the summary statistics and how they related to eye status:

filt_melt_train %>% dplyr::group_by(eyeDetection, Electrode) %>% 
    dplyr::summarise(mean = mean(microvolts), median=median(microvolts), sd=sd(microvolts)) %>% 
    dplyr::arrange(Electrode)
## `summarise()` has grouped output by 'eyeDetection'. You can override using the
## `.groups` argument.
## # A tibble: 28 × 5
## # Groups:   eyeDetection [2]
##    eyeDetection Electrode  mean median    sd
##    <fct>        <fct>     <dbl>  <dbl> <dbl>
##  1 0            AF3       4294.   4300  35.4
##  2 1            AF3       4305.   4300  34.4
##  3 0            F7        4015.   4020  28.4
##  4 1            F7        4007.   4000  24.9
##  5 0            F3        4268.   4260  20.9
##  6 1            F3        4269.   4260  17.4
##  7 0            FC5       4124.   4120  17.3
##  8 1            FC5       4124.   4120  19.2
##  9 0            T7        4341.   4340  13.9
## 10 1            T7        4342.   4340  15.5
## # … with 18 more rows

4 Based on these analyses are any electrodes consistently more intense or varied when eyes are open?

Frequency-Space

We can also explore the data in frequency space by using a Fast Fourier Transform.
After the FFT we can summarise the distributions of frequencies by their density across the power spectrum. This will let us see if there any obvious patterns related to eye status in the overall frequency distributions.

eegkit::eegpsd(eeg_train %>% dplyr::filter(eyeDetection == 0) %>% dplyr::select(-eyeDetection, -ds), Fs = Fs, xlab="Eye Open")

eegkit::eegpsd(eeg_train %>% dplyr::filter(eyeDetection == 1) %>% dplyr::select(-eyeDetection, -ds), Fs = Fs, xlab="Eye Closed")

8 Do you see any differences between the power spectral densities for the two eye states? If so, describe them.

Independent Component Analysis

We may also wish to explore whether there are multiple sources of neuronal activity being picked up by the sensors.
This can be achieved using a process known as independent component analysis (ICA) which decorrelates the channels and identifies the primary sources of signal within the decorrelated matrix.

ica <- eegkit::eegica(eeg_train %>% dplyr::select(-eyeDetection, -ds), nc=3, method='fast', type='time')
mix <- dplyr::as_tibble(ica$M)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
mix$eyeDetection <- eeg_train$eyeDetection
mix$ds <- eeg_train$ds

mix_melt <- reshape2::melt(mix, id.vars=c("eyeDetection", "ds"), variable.name = "Independent Component", value.name = "M")


ggplot2::ggplot(mix_melt, ggplot2::aes(x=ds, y=M, color=`Independent Component`)) + 
  ggplot2::geom_line() + 
  ggplot2::geom_vline(ggplot2::aes(xintercept=ds), data=dplyr::filter(mix_melt, eyeDetection==1), alpha=0.005) +
  ggplot2::scale_y_log10()
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 8987 row(s) containing missing values (geom_path).

9 Does this suggest eye activate forms an independent component of activity across the electrodes?

Eye Opening Prediction

Now that we’ve explored the data let’s use a simple model to see how well we can predict eye status from the EEGs:

model <- h2o::h2o.gbm(x = colnames(dplyr::select(eeg_train, -eyeDetection, -ds)), 
                      y = colnames(dplyr::select(eeg_train, eyeDetection)),
                      training_frame = h2o::as.h2o(eeg_train),
                      validation_frame = eeg_validate,
                      distribution = "bernoulli",
                      ntrees = 100,
                      max_depth = 4,
                      learn_rate = 0.1)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |======================================================================| 100%
print(model)
## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  GBM_model_R_1655040503280_500 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             100                      100               24848         4
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         4    4.00000         12         16    15.17000
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.1076065
## RMSE:  0.3280343
## LogLoss:  0.3600893
## Mean Per-Class Error:  0.1302043
## AUC:  0.9464602
## AUCPR:  0.9406235
## Gini:  0.8929204
## R^2:  0.5657448
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      4295  621 0.126322   =621/4916
## 1       546 3526 0.134086   =546/4072
## Totals 4841 4147 0.129840  =1167/8988
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.454543    0.858012 207
## 2                       max f2  0.303734    0.899849 270
## 3                 max f0point5  0.582481    0.882476 160
## 4                 max accuracy  0.464421    0.870717 203
## 5                max precision  0.990029    1.000000   0
## 6                   max recall  0.063184    1.000000 381
## 7              max specificity  0.990029    1.000000   0
## 8             max absolute_mcc  0.462354    0.739213 204
## 9   max min_per_class_accuracy  0.449299    0.868861 209
## 10 max mean_per_class_accuracy  0.454543    0.869796 207
## 11                     max tns  0.990029 4916.000000   0
## 12                     max fns  0.990029 4071.000000   0
## 13                     max fps  0.014653 4916.000000 399
## 14                     max tps  0.063184 4072.000000 381
## 15                     max tnr  0.990029    1.000000   0
## 16                     max fnr  0.990029    0.999754   0
## 17                     max fpr  0.014653    1.000000 399
## 18                     max tpr  0.063184    1.000000 381
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.1200593
## RMSE:  0.3464957
## LogLoss:  0.3894168
## Mean Per-Class Error:  0.1585421
## AUC:  0.9239359
## AUCPR:  0.9173075
## Gini:  0.8478718
## R^2:  0.5157124
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error       Rate
## 0      1328  307 0.187768  =307/1635
## 1       176 1185 0.129317  =176/1361
## Totals 1504 1492 0.161215  =483/2996
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.425963    0.830705 225
## 2                       max f2  0.317548    0.887594 268
## 3                 max f0point5  0.606576    0.850985 152
## 4                 max accuracy  0.516521    0.846462 189
## 5                max precision  0.978424    1.000000   0
## 6                   max recall  0.084742    1.000000 373
## 7              max specificity  0.978424    1.000000   0
## 8             max absolute_mcc  0.479757    0.690191 204
## 9   max min_per_class_accuracy  0.458183    0.839089 213
## 10 max mean_per_class_accuracy  0.479757    0.844921 204
## 11                     max tns  0.978424 1635.000000   0
## 12                     max fns  0.978424 1358.000000   0
## 13                     max fps  0.012874 1635.000000 399
## 14                     max tps  0.084742 1361.000000 373
## 15                     max tnr  0.978424    1.000000   0
## 16                     max fnr  0.978424    0.997796   0
## 17                     max fpr  0.012874    1.000000 399
## 18                     max tpr  0.084742    1.000000 373
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

10 What validation performance can you get with h2o::h2o.xgboost instead?

11 Using the best performing of the two models calculate the test performance

#perf <- h2o::h2o.performance(model = , newdata = h2o::as.h2o(eeg_test))
#print(perf)

12 Describe 2 possible alternative modelling approaches we discussed in class but haven’t explored in this notebook.